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Abstract 

We consider the problem of estimating the unconditional distribution of a post-model-selection esti- 
mator. The notion of a post-model-selection estimator here refers to the combined procedure resulting 
from first selecting a model (e.g., by a model selection criterion like AIC or by a hypothesis testing 
procedure) and then estimating the parameters in the selected model (e.g., by least-squares or maximum 
likelihood), all based on the same data set. We show that it is impossible to estimate the unconditional 
distribution with reasonable accuracy even asymptotically. In particular, we show that no estimator 
for this distribution can be uniformly consistent (not even locally). This follows as a corollary to (lo- 
cal) minimax lower bounds on the performance of estimators for the distribution; performance is here 
measured by the probability that the estimation error exceeds a given threshold. These lower bounds 
are shown to approach 1/2 or even 1 in large samples, depending on the situation considered. Similar 
impossibility results are also obtained for the distribution of linear functions (e.g., predictors) of the 
post-model-selection estimator. 
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1 Introduction and Overview 



In many statistical applications a data-based model selection step precedes the final parameter estimation and 
inference stage. For example, the specification of the model (choice of functional form, choice of regressors, 



number of lags, etc.) is often based on the data. In contrast, the traditional theory of statistical inference 
is concerned with the properties of estimators and inference procedures under the central assumption of 
an a priori given model. That is, it is assumed that the model is known to the researcher prior to the 
statistical analysis, except for the value of the true parameter vector. As a consequence, the actual statistical 
properties of estimators or inference procedures following a data-driven model selection step are not described 
by the traditional theory which assumes an a priori given model; in fact, they may differ substantially 
from the properties predicted by this theory, cf., e.g., Danilov and Magnus (2004), Dijkstra and Vcldkamp 
(1988), Potscher (1991, Section 3.3), or Rao and Wu (2001, Section 12). Ignoring the additional uncertainty 
originating from the data-driven model selection step and (inappropriately) applying traditional theory can 
hence result in very misleading conclusions. 

Investigations into the distributional properties of post-model-selection estimators, i.e., of estimators 
constructed after a data-driven model selection step, are relatively few and of recent vintage. Sen (1979) 
obtained the unconditional large-sample limit distribution of a post-model-selection estimator in an i.i.d. 
maximum likelihood framework, when selection is between two competing nested models. In Potscher (1991) 
the asymptotic properties of a class of post-model-selection estimators (based on a sequence of hypothesis 
tests) were studied in a rather general setting covering non-linear models, dependent processes, and more 
than two competing models. In that paper, the large-sample limit distribution of the post-model-selection 
estimator was derived, both unconditional as well as conditional on having chosen a correct model, not 
necessarily the minimal one oCG c dso Potscher and Novak (1998) for further discussion and a simulation 
study, and Nickl (2003) for extensions. The finite-sample distribution of a post-model-selection estimator, 
both unconditional and conditional on having chosen a particular (possibly incorrect) model, was derived 
in Leeb and Potscher (2003) in a normal linear regression framework; this paper also studied asymptotic 
approximations that are in a certain sense superior to the asymptotic distribution derived in Potscher (1991). 
The distributions of corresponding linear predictors constructed after model selection were studied in Leeb 
(2005, 2006). Related work can also be found in Sen and Saleh (1987), Kabaila (1995), Potscher (1995), 
Ahmed and Basu (2000), Kapetanios (2001), Hjort and Claeskcns (2003), Dukic and Pefia (2005), and Leeb 
and Potscher (2005a). The latter paper provides a simple exposition of the problems of inference post model 
selection and may serve as an entry point to the present paper. 

It transpires from the papers mentioned above that the finite-sample distributions (as well as the large- 
sample limit distributions) of post-model-selection estimators typically depend on the unknown model pa- 
rameters, often in a complicated fashion. For inference purposes, e.g., for the construction of confidence 
sets, estimators of these distributions would be desirable. Consistent estimators of these distributions can 
typically be constructed quite easily, e.g., by suitably replacing unknown parameters in the large-sample 
limit distributions by estimators; cf. Section 2.2.1. However, the merits of such 'plug-in' estimators in small 
samples are questionable: It is known that the convergence of the finite-sample distributions to their large- 
sample limits is typically not uniform with respect to the underlying parameters (see Appendix B below and 
Corollary 5.5 in Leeb and Potscher (2003)), and there is no reason to believe that this non-uniformity will 
disappear when unknown parameters in the large-sample limit are replaced by estimators. This observation 
is the main motivation for the present paper to investigate in general the performance of estimators of the 
distribution of a post-model-selection estimator, where the estimators of the distribution are not necessar- 
ily 'plug-in' estimators based on the limiting distribution. In particular, we ask whether estimators of the 
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distribution function of post-model-selection estimators exist that do not suffer from the non-uniformity 
phenomenon mentioned above. As we show in this paper the answer in general is 'No'. We also show that 
these negative results extend to the problem of estimating the distribution of linear functions (e.g., linear 
predictors) of post-model-selection estimators. Similar negative results apply also to the estimation of the 
mean squared error or bias of post-model-selection estimators; cf. Remark 4.7. 
To fix ideas consider for the moment the linear regression model 

Y = V X + Wtp + u (1) 

where V and W, respectively, represent n x k and n x I non-stochastic regressor matrices (k > 1, 1 > 1), and 
the n x 1 disturbance vector u is normally distributed with mean zero and variance-covariance matrix a 2 I n - 
We also assume for the moment that (V : W)'(V : W)/n converges to a non-singular matrix as the sample 
size n goes to infinity and that linin^oo V'W/n ^ (for a discussion of the case where this limit is zero see 
Example 1 in Section 2.2.2). Now suppose that the vector \ represents the parameters of interest, while 
the parameter vector ip and the associated regressors in W have been entered into the model only to avoid 
possible misspecification. Suppose further that the necessity to include ip or some of its components is then 
checked on the basis of the data, i.e., a model selection procedure is used to determine which components 
of ip are to be retained in the model, the inclusion of \ n ot being disputed. The selected model is then used 
to obtain the final (post-model-selection) estimator \ f° r X- We are now interested in the unconditional 
finite-sample distribution of \ (appropriately scaled and centered). Denote this /c-dimensional cumulative 
distribution function (cdf) by G n j ia (t). As indicated in the notation, this distribution function depends on 
the true parameters 9 = (x',ip'Y an d cr. For the sake of dcfiniteness of discussion assume for the moment 
that the model selection procedure used here is the particular 'general-to-specific' procedure described at 
the beginning of Section 2; we comment on other model selection procedures, including Akaike's AIC and 
thresholding procedures, below. 

As mentioned above, it is not difficult to construct a consistent estimator of G n ,0,<j(t) for any t, i.e., an 
estimator G n (t) satisfying 

Pn,e,* (\Gn(t) - G n ,eAt)\ > &) ™ (2) 
for each 5 > and each 9, a; see Section 2.2.1. However, it follows from the results in Section 2.2.2 that any 
estimator satisfying (2), i.e., any consistent estimator of G n ,e,<j(t), necessarily also satisfies 

sup P nAa ( G n (t) - G n . e .At) > S) ™ 1 (3) 

\\0\\<R V > 

for suitable positive constants R and 5 that do not depend on the estimator. That is, while the probability 
in (2) converges to zero for every given 9 by consistency, relation (3) shows that it does not do so uniformly 
in 9. It follows that G n (t) can never be uniformly consistent (not even when restricting consideration to 
uniform consistency over all compact subsets of the parameter space). Hence, a large sample size does not 
guarantee a small estimation error with high probability when estimating the distribution function of a post- 
model-selection estimator. In this sense, reliably assessing the precision of post-model-selection estimators 
is an intrinsically hard problem. Apart from (3), we also provide minimax lower bounds for arbitrary (not 
necessarily consistent) estimators of the conditional distribution function G n .e^{t). For example, we provide 
results that imply that 

liminf inf sup P nA<T ( G n (t) - G nfita {t) > S) > (4) 

™^°° G„(t) \\6\\<R V 7 
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holds for suitable positive constants R and S, where the infimum extends over all estimators of G n) e )<T (t). 
The results in Section 2.2.2 in fact show that the balls \\6\\ < R in (3) and (4) can be replaced by suitable 
balls (not necessarily centered at the origin) shrinking at the rate n -1 / 2 . This shows that the non-uniformity 
phenomenon described in (3)-(4) is a local, rather than a global, phenomenon. In Section 2.2.2 we further 
show that the non-uniformity phenomenon expressed in (3) and (4) typically also arises when the parameter 
of interest is not x, but some other linear transformation of = (x', ip )'■ As discussed in Remark 4.3, the 
results also hold for randomized estimators of the unconditional distribution function G n .e,a{t)- Hence no 
resampling procedure whatsoever can alleviate the problem. This explains the anecdotal evidence in the 
literature that resampling methods are often unsuccessful in approximating distributional properties of post- 
model-selection estimators (e.g., Dijkstra and Veldkamp (1988), or Freedman, Navidi, and Peters (1988)). 
See also the discussion on resampling in Section 6. 

The results outlined above are presented in Section 2.2 for the particular 'general-to-specific' model 
selection procedure described at the beginning of Section 2. Analogous results for a large class of model 
selection procedures, including Akaike's AIC and thresholding procedures, are then given in Section 3, based 
on the results in Section 2.2. In fact, the non-uniformity phenomenon expressed in (3)-(4) is not specific to 
the model selection procedures discussed in Sections 2 and 3 of the present paper, but will occur for most 
(if not all) model selection procedures, including consistent ones; cf. Sections 5 and 6 for more discussion. 
Section 5 also shows that the results are - as is to be expected - by no means limited to the linear regression 
model. 

We focus on the unconditional distributions of post-model-selection estimators in the present paper. One 
can, however, also envisage a situation where one is more interested in the conditional distribution given 
the outcome of the model selection procedure. In line with the literature on conditional inference (see, 
e.g., Robinson (1979) or Lehmann and Casella (1998, p. 421)), one may argue that, given the outcome 
of the model selection step, the relevant object of interest is the conditional rather than the unconditional 
distribution of the post-model-selection estimator. In this case similar results can be obtained and are 
reported in Lecb and Potscher (2006b). We note that on a technical level the results in Lceb and Potscher 
(2006b) and in the present paper require separate treatment. 

The plan of the paper is as follows: Post-model-selection estimators based on a 'general-to-specific' model 
selection procedure are the subject of Section 2. After introducing the basic framework and some notation, 
like the family of models M p from which the 'general-to-specific' model selection procedure p selects, as 
well as the post- model-selection estimator 6, the unconditional cdf G n ,e,<j{t) of (a linear function of) the 
post-model-selection estimator is discussed in Section 2.1. Consistent estimators of G n .g :lJ (t) are given in 
Section 2.2.1. The main results of the paper are contained in Section 2.2.2 and Section 3: In Section 2.2.2 
we provide a detailed analysis of the non- uniformity phenomenon encountered in (3)- (4). In Section 3 the 
'impossibility' result from Section 2.2.2 is extended to a large class of model selection procedures including 
Akaike's AIC and to selection from a non-nested collection of models. Some remarks are collected in Section 
4, while Section 5 discusses extensions and the scope of the results of the paper. Conclusions are drawn 
in Section 6. All proofs as well as some auxiliary results are collected into appendices. Finally a word on 
notation: The Euclidean norm is denoted by ||-||, and \ max (E) denotes the largest eigenvalue of a symmetric 
matrix E. A prime denotes transposition of a matrix. For vectors x and y the relation x < y (x < y, 
respectively) denotes Xi < yt (xt < yi, respectively) for all i. As usual, $ denotes the standard normal 
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distribution function. 



2 Results for Post-Model-Selection Estimators Based on a 
'General-to-Specific' Model Selection Procedure 

Consider the linear regression model 

Y = X9 + u, (5) 

where X is a non-stochastic n x P matrix with rank(X) = P and u <~ N(Q, <r 2 I n ), a 2 > 0. Here n denotes 
the sample size and we assume n > P > 1. In addition, we assume that Q = lim^oo X'X/n exists and is 
non-singular. In this section we shall - similar as in Potscher (1991) - consider model selection from the 
collection of nested models Mo C Mq+i C • • • C Mp, where O is specified by the user, and where for 
< p < P the model M p is given by 

M p = {(6 1 ,..., 9 P )> e R P : 9 p+1 =...=0 P = O}. 

[In Section 3 below also general non-nested families of models will be considered.] Clearly, the model M p 
corresponds to the situation where only the first p regressors in (5) are included. For the most parsimonious 
model under consideration, i.e., for Mq, we assume that O satisfies < O < P; if O > 0, this model contains 
as free parameters only those components of the parameter vector 9 that are not subject to model selection. 
[In the notation used in connection with (1) we then have \ = (#i> • • • > Qo)' an d V" = {6o+i, ■ ■ ■ , Qp)' ■] 
Furthermore, note that M — {(0, . . . , 0)'} and that Mp = R p . We call M p the regression model of order p. 

The following notation will prove useful. For matrices B and C of the same row-dimension, the column- 
wise concatenation of B and C is denoted by (B : C). If D is an m x P matrix, let D[p] denote the m x p 
matrix consisting of the first p columns of D. Similarly, let D[^p] denote the rax (P — p) matrix consisting 
of the last P — p columns of D. If x is a P x 1 vector, we write in abuse of notation x[p] and x[^p] for (x'[p\)' 
and (x'[-^p})' , respectively. [We shall use the above notation also in the 'boundary' cases p = and p = P. 
It will always be clear from the context how expressions containing symbols like D[0], D[->P], x[0], or x[->P] 
arc to be interpreted.] As usual, the z-th component of a vector x is denoted by Xi, and the entry in the i-th 
row and j'-th column of a matrix B is denoted by Bij . 

The restricted least-squares estimator of 9 under the restriction 9[->p] = 0, i.e., under 9 p+1 = ■ ■ ■ = 6p = 0, 
will be denoted by 0(p), < p < P (in case p = P the restriction being void). Note that 9(p) is given by the 
P x 1 vector 

' (X[p]'X[p])- l X[p]'Y 

(o,...,oy 

where the expressions 9(0) and 9(P), respectively, are to be interpreted as the zero- vector in R p and as the 
unrestricted least-squares estimator of 9. Given a parameter vector 9 in R p , the order of 9 (relative to the 
nested sequence of models M p ) is defined as 

p {9) = min {p : < p < P, 9 G M p } . 

Hence, if 9 is the true parameter vector, a model M p is a correct model if and only if p > po{9). We stress 
that po(9) is a property of a single parameter, and hence needs to be distinguished from the notion of the 
order of the model M p introduced earlier, which is a property of the set of parameters M p . 
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A model selection procedure is now nothing else than a data-driven (measurable) rule p that selects a 
value from {O, . . . , P} and thus selects a model from the list of candidate models Mo, ■ ■ ■ , Mp. In this section 
we shall consider as an important leading case a 'general-to-specific' model selection procedure based on a 
sequence of hypothesis tests. [Results for a larger class of model selection procedures, including Akaike's AIC, 
are provided in Section 3.] This procedure is given as follows: The sequence of hypotheses Hq : po(8) < P is 
tested against the alternatives Hf : po(0) = p in decreasing order starting at p = P. If, for some p > O, Hq 
is the first hypothesis in the process that is rejected, we set p = p. If no rejection occurs until even Hq +1 
is not rejected, we set p = O. Each hypothesis in this sequence is tested by a kind of t-test where the error 
variance is always estimated from the overall model (but see the discussion following Theorem 3.1 in Section 
3 below for other choices of estimators of the error variance). More formally, we have 

]5 = max{p: \T p \ > c p , < p < P} , (6) 

with co = in order to ensure a well-defined p in the range {0,0 + 1, . . . ,P}. For O < p < P, the 
critical values c p satisfy < c p < oo and are independent of sample size (but see also Remark 4.2). The 
test-statistics are given by 

T p = ^M (0< P <P) 
with the convention that To = 0. Furthermore, 




denotes the nonnegative square root of the p-th diagonal element of the matrix indicated, and a is given by 

a 2 = (n- P)- X {Y - X~8{P))'{Y - X8(P)). 

Note that under the hypothesis Hq the statistic T p is t-distributed with n — P degrees of freedom for < 
p < P. It is also easy to see that the so-defined model selection procedure p is conservative: The probability 
of selecting an incorrect model, i.e., the probability of the event {p < po(9)}, converges to zero as the sample 
size increases. In contrast, the probability of the event {p = p}, for p satisfying m&x{p n (9) 7 (D} < p < P, 
converges to a positive limit; cf., for example, Proposition 5.4 and equation (5.6) in Leeb (2006). 

The post-model-selection estimator 8 can now be defined as follows: On the event p = p, 8 is given by 
the restricted least-squares estimator 8(p), i.e., 

p 

8=J2~ e (p)HP = P), (7) 

v=o 

where l(-) denotes the indicator function of the event shown in the argument. 
2.1 The Distribution of the Post-Model-Selection Estimator 

We now introduce the distribution function of a linear transformation of 8 and summarize some of its 
properties that will be needed in the subsequent development. To this end, let A be a non-stochastic k x P 
matrix of rank k, 1 < k < P, and consider the cdf 

G n ,eAt) = P n ,o,a (V^A(8 -6)<t) (t e R fe ). (8) 
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Here P n fi,a{-) denotes the probability measure corresponding to a sample of size n from (5). 

Depending on the choice of the matrix A, several important scenarios are covered by (8): The cdf of 
^/n(9 — 9) is obtained by setting A equal to the P x P identity matrix Ip. In case O > 0, the cdf of those 
components of y/n(9 — 9) which correspond to the parameter of interest x m (1) can be studied by setting A 
to the O x P matrix (Iq ■ 0) as we then have A9 = (6\, . . . , Oo)' = X- Finally if A ^ is an 1 x P vector, we 
obtain the distribution of a linear predictor based on the post-model-selection estimator. See the examples 
at the end of Section 2.2.2 for more discussion. 

The cdf G n fi^ and its properties have been analyzed in detail in Leeb and Potscher (2003) and Leeb 
(2006). To be able to access these results we need some further notation. Note that on the event p = p the 
expression A(9 — 9) equals A(9(p) — 9) in view of (7). The expected value of the restricted least-squares 
estimator 9{p) will be denoted by rj n (p) and is given by the P x 1 vector 

nM ={ (o or ) (9 > 

with the conventions that ?7„(0) = (0, . . . , 0)' G R p and that r] n (P) = 9. Furthermore, let <J>„ !P denote the 
cdf of ^/nA(9(p) — r] n (p)), i.e., the cdf of yfnA times the restricted least-squares estimator based on model 
M p centered at its mean. Hence, $„ iP is the cdf of a fc-variate Gaussian random vector with mean zero and 
variance-covariance matrix <j 2 A{p](X{p]'X{p}/n)~ 1 A{p}' in case p > 0, and it is the cdf of point-mass at zero 
in R k in case p = 0. If p > and if the matrix A[p] has full row rank fc, then <I> n! p has a density with respect 
to Lebesgue measure, and we shall denote this density by <i> np - We note that i] n (p) depends on 9 and that 
&n, P depends on a (in case p > 0), although these dependencies are not shown explicitly in the notation. 
For p > we introduce 

b n>p = C^' {A[p](X[p]' Xlpyn^Alp]')- , (10) 

and 

C = C - C^'{A\p]{X\p\X\p]/n)-'A\p]')-C^\ (11) 

with C„. p > 0. Here Cn ^ = A[p](X[pY X[p]/n)~ 1 e p , where e p denotes the p-th standard basis vector in R p , 
and B~ denotes a generalized inverse of a matrix B. [Observe that C, 2 n p is invariant under the choice of the 
generalized inverse. The same is not necessarily true for 6„ iP , but is true for b UtP z for all z in the column- 
space of A[p\. Also note that (12) below depends on b np only through b np z with z in the column-space of 
A[p].] We observe that the vector of covariances between A9(p) and 9 p {p) is precisely given by a 2 n~ 1 C^ 
(and hence does not depend on 9). Furthermore, observe that A9(p) and 9 p {p) are uncorrelated if and only 
if Cn P = £,n P if an d only if b n _ p z — for all z in the column-space of A\p]; cf. Lemma A. 2 in Leeb (2005). 

Finally, for a univariate Gaussian random variable 01 with zero mean and variance s 2 , s > 0, we write 
A s (a, b) for P(|0T — a\ < 6), a e RU {— oo, oo}, b e R. Note that A s (-, •) is symmetric around zero in its first 
argument, and that A s (— oo,6) = A;s(oo,6) = holds. In case s = 0, is to be interpreted as being equal 
to zero, hence a \— > A (a, b) reduces to the indicator function of the interval (—b,b). 



7 



We are now in a position to present the explicit formula for G n ^ j(7 (t) derived in Leeb (2006): 

p oo P 

Gn,eAt) = *n,o(*- VnA(r] n (0) -6)) / JJ A CT€n i4 (>/n»?„,g(g), sc q a^ q )h{s)ds 

J ° q=0 + l 

P P POO 

+ E / / (i-V^^W + ^^.^y) (12) 

p 

II ^..(^mW' s c ?^„, g )/i(s)ds] $ n ,p(c?2;). 
<?=P+1 

In the above display, <& nyP (dz) denotes integration with respect to the measure induced by the normal cdf 
&n,p on R fc and h denotes the density of <x/<7, i.e., h is the density of (n — P) -1 / 2 times the square-root of a 
chi-square distributed random variable with n — P degrees of freedom. The finite-sample distribution of the 
post-model-selection estimator given in (12) is in general not normal, e.g., it can be bimodal; see Figure 2 in 
Leeb and Potscher (2005a) or Figure 1 in Leeb (2006). [An exception where (12) is normal is the somewhat 
trivial case where Ci p) = 0, i.e., where A6{p) and ~6 p {p) are uncorrelated, for p = O + 1, . . . , P; see Leeb 
(2006, Section 3.3) for more discussion.] We note for later use that G n _g^ a {t) — J2 p=0 G nt g^(t\p)n nt g^(p) 
where G n ,e,<j{t\p) represents the cdf of y/nA(9 — 9) conditional on the event {p — p} and where K n ,e,tj{p) is 
the probability of this event under P n ,e,a- Note that ir n ,e,a{p) is always positive for O < p < P; cf. Leeb 
(2006), Section 3.2. 

To describe the large-sample limit of G n ,e,a, some further notation is necessary. For p satisfying < p < 
P, partition the matrix Q = lim^oo X'X/n as 



Q = 



Q[p : p] Q[p : -*p] 
QhP ■ P] QhP ■ -'P] 



where Q[p : p] is a p x p matrix. Let $00 ,p be the cdf of a fc-variate Gaussian random vector with mean zero 
and variance-covariance matrix a 2 A[p]Q[p : p]~ 1 A[p]' J < p < P, and let ^o^o denote the cdf of point-mass 
at zero in R fc . Note that $00^ has a Lebesgue density if p > and the matrix A[p] has full row rank k; in 
this case, we denote the Lebesgue density of $oo,p by (p^ Finally, for p = 1, . . . , P, define 

= (QiP ■pY 1 )p.:p, 

Clo, P = ZL P ~ Cg>'(A\p]Q\p : pr^bircW, (13) 
b o* = Cg>'{A\p]Q\p:p]- 1 A\p]')-, 
where C^ = A[p]Q[p : p]~ 1 e pi with e p denoting thep-th standard basis vector in R p ; furthermore, take (,eo,p 
and as the nonnegative square roots of Coo, P an d £.oo, P i respectively. As the notation suggests, &co,p is 
the large-sample limit of $ n , p , and Ceo , ano - Coc, P arc tnc limits of C'n \ and £ 2 p , respectively; 

moreover, b n , P z converges to boa. p z for each z in the column-space of A\p\. See Lemma A. 2 in Leeb (2005). 

The next result describes the large-sample limit of the cdf under local alternatives to 6 and is taken from 
Leeb (2006, Corollary 5.6). Recall that the total variation distance between two cdfs G and G* on R fe is 
defined as \\G — G*\\tv = sup s \G{E) — G*(E)\, where the suprcmum is taken over all Borel sets E. Clearly, 
the relation \G(t) - G*(t)\ < \\G — G*\\tv holds for all t <E R fe . Thus, if G and G* are close with respect to 
the total variation distance, then G(t) is close to G*(t), uniformly in t. 
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Proposition 2.1 Suppose 9 € R p and 7 e R p and let be a sequence of positive real numbers which 
converges to a (finite) limit a > as n — > 00. Then the cdf G n6+1 j ^ a ( n ) converges to a limit Goo, 0,0-, 7 in 
total variation, i.e., 

I | G «,e+ 7 /yH, ( T(") - Goo,e,<r,i\\ TV "— ~ 0. (14) 
TTie large-sample limit cdf ' Goo, 0,0,7 {t) is given by 

p 

Zoovit-P^) II A <^oo,,K> C 9<x>, 9 ) 
g=p»+l 

P P 
p=p t +l-'2<^/3 <p) 9=p+ l 

where p* = max{po(#), O}. Here for < p < P 



(3 {p) = A 



Q[p ■ p] 1 Q[p ■ ->ph[->p] 
-ihp] 



with the convention that [3^ = —Aj if p = and that [3^ = (0, . . . , 0)' if p = P. Furthermore, we have 
set u p = 7 p + (Q[p : p]^ 1 Q[p ■ ^p]l[^p]) P for p > 0. [Note that f3 {p) = lim^^ ^/nA(n n (p) - - j/y/n) 
for p > po(6), and that v v = lim^oo \/nr] np {p) for p > po{9). Here rj n (p) is defined as in (9), but with 
+ 7/ y/n replacing 8.] 

If p* > and if the matrix A[p*\ has full row rank k, then the Lebesgue density of $oo,p exists for 

all p> p* and hence the density of (15) exists and is given by 

p 

q=p„+l 

P P 
P=P» + 1 q=p+l 

Like the finite-sample distribution, the limiting distribution of the post-modcl-sclection estimator given 
in (15) is in general not normal. An exception is the case where — for p > p* in which case Goo,0,o-,7 
reduces to $00, p; see Remark A. 6 in Appendix A. If 7 = 0, we write G 00, e,<r{t) as shorthand for Goo,e,a-,o(t) 
in the following. 

2.2 Estimators of the Finite-Sample Distribution 

For the purpose of inference after model selection the finite-sample distribution of the post-model-selection- 
cstimator is an object of particular interest. As we have seen, it depends on unknown parameters in a 
complicated manner, and hence one will have to be satisfied with estimators of this cdf. As we shall see, 
it is not difficult to construct consistent estimators of G nt g i(T (t). However, despite this consistency result, 
we shall find in Section 2.2.2 that any estimator of G n ^^{t) typically performs unsatisfactory, in that the 
estimation error can not become small uniformly over (subsets of) the parameter space even as sample size 
goes to infinity. In particular, no uniformly consistent estimators exist, not even locally. 
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2.2.1 Consistent Estimators 

We construct a consistent estimator of G n< g t<T {t) by commencing from the asymptotic distribution. Spe- 
cializing to the case 7 = and cr(") = u in Proposition 2.1, the large-sample limit of G n ^ M (t) is given 

by 

p 

Goo,9,*(t) = *«.*.(*) 1] A ^J°> C ^oo, q ) 
q=p,+l 

P P 

+ S / ( 1 ~ A «'C oe ,p(6oo^,CpaC 00iI) ))$ OOJ) (dz) H A <* o„(°><V^oo,,) (16) 

with = max{p (#), C}- Note that Goo.e.a^) depends on only through p*. Let $„ jP denote the cdf of a fc- 
variate Gaussian random vector with mean zero and variance-covariance matrix (7 2 A[p](A[p]'A[p]/n)~ 1 yl[p]', 
< p < P; we also adopt the convention that $ nj o denotes the cdf of point-mass at zero in R fe . [We use the 
same convention for & n , P in case a — 0, which is a probability zero event.] An estimator G n (t) of G n ,e,ir(t) 
is now defined as follows: We first employ an auxiliary procedure p that consistently estimates po(9) (e.g., 
p could be obtained from BIC or from a 'general-to-specific' hypothesis testing procedure employing critical 
values that go to infinity but are o(n 1 / 2 ) asrn 00). The estimator G n (t) is now given by the expression 
in (16) but with p*, a, &oo, P , Coo,p> £oo, P , and $oc, P replaced by max{p,C}, a, b n , p , C„, p , £„, p , and $„ ;P , 
respectively. A little reflection shows that G n is again a cdf. We have the following consistency results. 

Proposition 2.2 The estimator G n is consistent (in the total variation distance) for G n fi,a an d Goo.e.a- 
That is, for every S > 

Pn,6,a(\\G n (-)-G n ,eA-)\\ TV > 8) ™ 0, (17) 

Pn,e,a{\\G n {-)-G^eA-)\\ T v > s ) ™ ( 18 ) 

for all 9 e R p and a/Z a > 0. 

While the estimator constructed above on the basis of the formula for Goo, 0,0- is consistent, it can be 
expected to perform poorly in finite samples since convergence of G n .e,a to Goo. 0,0- is typically not uniform 
in 9 (cf. Appendix B), and since in case the true 9 is 'close' to M po rg\_i the auxiliary decision procedure p 
(although being consistent for Po(0)) will then have difficulties making the correct decision in finite samples. 
In the next section we show that this poor performance is not particular to the estimator G„ constructed 
above, but is a genuine feature of the estimation problem under consideration. 

2.2.2 Performance Limits and Impossibility Results 

We now provide lower bounds for the performance of estimators of the cdf G n fi,a{t) of the post-model- 
selcction estimator A9; that is, we give lower bounds on the worst-case probability that the estimation 
error exceeds a certain threshold. These lower bounds are large, being 1 or 1/2, depending on the situation 
considered; furthermore, they remain lower bounds even if one restricts attention only to certain subsets of 
the parameter space that shrink at the rate n~ x / 2 . In this sense the 'impossibility' results are of a local 
nature. In particular, the lower bounds imply that no uniformly consistent estimator of the cdf G n fi,a{t) 
exists, not even locally. 
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In the following, the asymptotic 'correlation' between A9(p) and 9 p (p) as measured by C^) — 
lim^oo Cn ^ will play an important role. [Recall that 9{p) denotes the least-squares estimator of 9 based 
on model M p and that A9 is the parameter vector of interest. Furthermore, the vector of covariances be- 
tween A9(p) and 9 p (p) is given by o^rT 1 ^ with C n p) = A\p]{X\p]' X\p]/n)- 1 e p .\ Note that C { £> equals 
-^MQIp : p] ~ 1 e p , and hence does not depend on the unknown parameters 9 or a. In the important special 
case discussed in the Introduction, cf. (1), the matrix A equals the O x P matrix (Iq : 0), and the condition 
Coo 7^ reduces to the condition that the regressor corresponding to the p-th column of (V : W) is asymp- 
totically correlated with at least one of the regressors corresponding to the columns of V. Sec Example 1 
below for more discussion. 

In the result to follow we shall consider performance limits for estimators of G n ,e,a(t) at a. fixed value of the 
argument t. An estimator of G n ,e 1<r (t) is now nothing else than a real- valued random variable T n = T n (Y, X). 
For mnemonic reasons we shall, however, use the symbol G n (t) instead of T n to denote an arbitrary estimator 
of G n fi^(t). This notation should not be taken as implying that the estimator is obtained by evaluating 
an estimated cdf at the argument t, or that it is a priori constrained to lie between zero and one. We shall 
use this notational convention mutatis mutandis also in subsequent sections. Regarding the non-uniformity 
phenomenon, we then have a dichotomy which is described in the following two results. 

Theorem 2.3 Suppose that A9(q) and 9 q {q) are asymptotically correlated, i.e., CoP ^ 0, for some q sat- 
isfying O < q < P, and let q* denote the largest q with this property. Then the following holds for every 
9 G M 9 ._i, every a, < a < oo, and every t € R fc : There exist So > and p Q , < p < oo, such that any 
estimator G n it) of G n j i(T (t) satisfying 

Pn,e,* (\Gn(t) ~ G nA a(t)\ > ™ (19) 

for each 8 > (in particular, every estimator that is consistent) also satisfies 

sup P nACT ( G n (t) - G n ^ a (t) > 6 ) ™ 1. (20) 
\\0-e\\<p o /vn 

The constants So and p may be chosen in such a way that they depend only on t, Q, A, a, and the critical 
values c p for O < p < P. Moreover, 

liminf inf sup P n ,s,a- l\G n {t) - G nii >, a (t) \>S >0 (21) 

™^°° G„(t) »€M,, V > 

||*-e||<Po/v^ 

and 

sup liminf inf sup P n fi,a ( G n (t) - G n ^ ta (t) > S) > -, (22) 

S>0 ™^°° G n (t) *£M,, v / 2 

||tf-e||< Po /v^ 

where the infima in (21) and (22) extend over all estimators G n (t) of G n fi,u{t). 

Remark 2.4 Assume that the conditions of the preceding theorem are satisfied. Suppose further that p®, 
O < p® < q*, is such that either p® > and some row of A[p®\ equals zero, or such that p® — 0. Then 
there exist Sq > and < p Q < oo such that the left-hand side of (21) is not less than 1/2 for each 9 e M pe . 

Theorem 2.3 a fortiori implies a corresponding 'impossibility' result for estimation of the function G n _g_ (J (-) 
when the estimation error is measured in the total variation distance or the sup- norm; cf. also Section 5. 
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It remains to consider the - quite exceptional - case where the assumption of Theorem 2.3 is not satisfied, 
i.e., where — 0, for all q in the range O < q < P. Under this 'uncorrelatedness' condition it is indeed 
possible to construct an estimator of G nj # j(T which is uniformly consistent: It is not difficult to see that the 
asymptotic distribution of G n ,e,a reduces to $00. p under this 'uncorrelatedness' condition. Furthermore, the 
second half of Proposition B.l in Appendix B shows that then the convergence of G n fi,a to its large-sample 
limit is uniform w.r.t. 6, suggesting <& n ,p, an estimated version of &oo,p, as an estimator for G„^ j(T . 

Proposition 2.5 Suppose that A9{q) and 9 q (q) are asymptotically uncorr elated, i.e., = 0, for all q 

satisfying O < q < P. Then 



sup sup 



,<r(\\$n,P-Gn,e,<r\\ TV >t) n ^?0 (23) 



holds for each S > 0, and for any constants cr* and a* satisfying < cr* < a* < oo. 

Inspection of the proof of Proposition 2.5 shows that (23) continues to hold if the estimator $ n ,p is 
replaced by any of the estimators $„. p for O < p < P. We also note that in case O — the assumption 
of Proposition 2.5 is never satisfied in view of Proposition 4.4 in Leeb and Potscher (2006b), and hence 
Theorem 2.3 always applies in that case. Another consequence of Proposition 4.4 in Leeb and Potscher 
(2006b) is that - under the 'uncorrelatedness' assumption of Proposition 2.5 - the restricted least squares 
estimators A9{q) for q > O perform asymptotically as well as the unrestricted estimator A9(P); this clearly 
shows that the case covered by Proposition 2.5 is highly exceptional. 

In summary we see that it is typically impossible to construct an estimator of G n ,9,a{t) which performs 
reasonably well even asymptotically. Whenever Theorem 2.3 applies, any estimator of G n ,9,cr(t) suffers 
from a non-uniformity defect which is caused by parameters belonging to shrinking 'tubes' surrounding 
M<j._i. For the sake of completeness, we remark that outside a 'tube' of fixed positive radius that surrounds 
M<j»_i the non-uniformity need not be present: Let q* be as in Theorem 2.3 and define the set U as 
U = {0 e R p : \6 g * | > r} for some fixed r > 0. Then <& n ,p(t) is an estimator of G n ,e,<j{t) that is uniformly 
consistent over 9 e U; more generally, it can be shown that then the relation (23) holds if the supremum 
over on the left-hand side is restricted to 9 e U. 

We conclude this section by illustrating the above results with some important examples. 

Example 1: (The distribution of x) Consider the model given in (1) with \ representing the parameter 
of interest. Using the general notation of Section 2, this corresponds to the case A6 = (6\, . . . , 9q)' = X 
with A representing the O x P matrix (Iq : 0). Here k — O > 0. The cdf G n fi,a then represents the cdf 
of \fn{x " x)- Assume first that lim n ^oo V'W/n ^ 0. Then ^ holds for some q > O. Consequently, 
the 'impossibility' results for the estimation of G n _g M given in Theorem 2.3 always apply. Next assume 
that lim^oo V'W/n = 0. Then = for every q > O. In this case Proposition 2.5 applies and a 
uniformly consistent estimator of G„^ j(T indeed exists. Summarizing we note that any estimator of G n _g M 
suffers from the non-uniformity phenomenon except in the special case where the columns of V and W 
are asymptotically orthogonal in the sense that lim^oo V'W/n = 0. But this is precisely the situation 
where inclusion or exclusion of the regressors in W has no effect on the distribution of the estimator x 
asymptotically; hence it is not surprising that also the model selection procedure does not have an effect on 
the estimation of the cdf of the post-model-selection estimator \- This observation may tempt one to enforce 
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orthogonality between the columns of V and W by either replacing the columns of V by their residuals from 
the projection on the column space of W or vice versa. However, this is not helpful for the following reasons: 
In the first case one then in fact avoids model selection as all the restricted least-squares estimators for \ 
under consideration (and hence also the post-model selection estimator \) m the reparameterized model 
coincide with the unrestricted least-squares estimator. In the second case the coefficients of the columns of 
V in the reparameterized model no longer coincide with the parameter of interest x (and again are estimated 
by one and the same estimator regardless of inclusion/exclusion of columns of the transformed M^-matrix). 

Example 2: (The distribution of 9) For A equal to Ip, the cdf G n ,e,<j is the cdf of \/n(9 — 9). Here, 
A9(q) reduces to 9(q), and hence A9{q) and 9 q (q) are perfectly correlated for every q > O. Consequently, 
the 'impossibility' result for estimation of G n ,e,a given in Theorem 2.3 applies. [In fact, the slightly stronger 
result mentioned in Remark 2.4 always applies here.] We therefore see that estimation of the distribution of 
the post-model-selection estimator of the entire parameter vector is always plagued by the non-uniformity 
phenomenon. 

Example 3: (The distribution of a linear predictor) Suppose A^OisalxP vector and one is interested 
in estimating the cdf G n _g_ a of the linear predictor AO. Then Theorem 2.3 and the discussion following 
Proposition 2.5 show that the non- uniformity phenomenon always arises in this estimation problem in case 
= 0. In case O > 0, the non-uniformity problem is generically also present, except in the degenerate case 
where C^ 1 = 0, for all q satisfying O < q < P (in which case Proposition 4.4 in Lecb and Potscher (2006b) 
shows that the least-squares predictors from all models M p , O < p < P, perform asymptotically equally 
well). 

3 Extensions to Other Model Selection Procedures Including AIC 

In this section we show that the 'impossibility' result obtained in the previous section for a 'general-to- 
specific' model selection procedure carries over to a large class of model selection procedures, including 
Akaike's widely used AIC. Again consider the linear regression model (5) with the same assumptions on the 
regressors and the errors as in Section 2. Let {0, 1} P denote the set of all 0-1 sequences of length P. For 
each x €{0, 1} P let M x denote the set {9 € R p : 9i(l — tj) = for 1 < i < P} where t, represents the i-th 
component of r. I.e., M t describes a linear submodel with those parameters 9i restricted to zero for which 
r, = 0. Now let 91 be a user-supplied subset of {0, 1} P . We consider model selection procedures that select 
from the set £H, or equivalently from the set of models {M t : r <G 91}. Note that there is now no assumption 
that the candidate models are nested (for example, if fH = {0, 1} P all possible submodels are candidates for 
selection). Also cases where the inclusion of a subset of regressors is undisputed on a priori grounds are 
obviously covered by this framework upon suitable choice of 9^. 

We shall assume throughout this section that d\ contains tf u u = (1, . . . , 1) and also at least one element r* 
satisfying |t*| = P — 1, where |t*| represents the number of non-zero coordinates of t*. Let r be an arbitrary 
model selection procedure, i.e., r = i(Y,X) is a random variable taking its values in V\. We furthermore 
assume throughout this section that the model selection procedure r satisfies the following mild condition: 
For every r* € 91 with |t*| = P — 1 there exists a positive finite constant c (possibly depending on t*) such 
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that for every 9 £ M tt which has exactly P — 1 non-zero coordinates 

lim P, h6 ,a ({t - t ful i}A{\T x ,\ > c}) = lim P nA a ({t = t.}A{|T t .| < c}) = 



(24) 



holds for every < a < oo. Here ▲ denotes the symmetric difference operator and T t< , represents the usual 
t-statistic for testing the hypothesis 9i( t ,) = in the full model, where i(t*) denotes the index of the unique 
coordinate of r* that equals zero. 

The above condition is quite natural for the following reason: For 9 £ M tt with exactly P — 1 non-zero 
coordinates, every reasonable model selection procedure will - with probability approaching unity - decide 
only between M tt and M tfull ; it is then quite natural that this decision will be based (at least asymptotically) 
on the likelihood ratio between these two models, which in turn boils down to the t-statistic. As will be 
shown below, condition (24) holds in particular for AlC-like procedures. 

Let A be a non-stochastic k x P matrix of full row rank k, 1 < k < P, as in Section 2.1. We then consider 
the cdf 

K nfi<a {t) = P nfi ,a W^A(fi - 9) <t) (teR k ) (25) 

of a linear transformation of the post-model-selection estimator 9 obtained from the model selection procedure 
r, i.e., 

= £0(t)l(t = t) 

rem 

where the Fx 1 vector 9{x) represents the restricted least-squares estimator obtained from model M t , with 
the convention that 9(t) ~ £ R p in case r = (0, . . . , 0). We then obtain the following result for estimation 
of K n fi >( j(t) at a fixed value of the argument t which parallels the corresponding 'impossibility' result in 
Theorem 2.3. 

Theorem 3.1 Let t* £ Eft satisfy |t*| = P — 1, and let i(t„) denote the index of the unique coordinate of 
r* that equals zero; furthermore, let c be the constant in (24) corresponding to t*. Suppose that A9(tf u u) 
and ^t(t,)(t/uii) are asymptotically correlated, i.e., AQ~ 1 e. (t ^ ) ^ 0, where e. (r<0 denotes the i(x*)-th standard 
basis vector in R p . Then for every 9 £ M tt which has exactly P — 1 non-zero coordinates, for every a, 
< a < oo, and for every t £ R fe the following holds: There exist So > and p , < p Q < oo, such that any 
estimator K n (t) of K n> >a (t) satisfying 

P n ,e,a (\K n (t) - K n ,oAt)\ > s ) ™ ( 26 ) 
for each 5 > (in particular, every estimator that is consistent) also satisfies 



sup 

||#-0||<Po/v^ 



Pn 



#,<r ( K n {t) - K n j^(t) > <5 ) 



(27) 



The constants So and p may be chosen in such a way that they depend only on t, Q, A, a, and c. Moreover, 
liminf inf sup P n ^ i<r ( K n (t) - K n ^ i<r (t) > S a ) > (28) 

and 



P n ,*,* (\K n (t) - K nti)ta (t)\ > S^j >l/2 
hold, where the infima in (28) and (29) extend over all estimators K n (t) of K n $^(t). 



sup liminf inf sup 

<5>o ™^°° K n (t) „ eR p 

||t?-e||< Po /v^ 



(29) 
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The basic condition (24) on the model selection procedure employed in the above result will certainly 
hold for any hypothesis testing procedure that (i) asymptotically selects only correct models, (ii) employs a 
likelihood ratio test (or an asymptotically equivalent test) for testing M Zfull versus smaller models (at least 
versus the models M ti> with t* as in condition (24)), and (iii) uses a critical value for the likelihood ratio 
test that converges to a finite positive constant. In particular, this applies to usual thresholding procedures 
as well as to a variant of the 'general-to-specific' procedure discussed in Section 2 where the error variance 
in the construction of the test statistic for hypothesis Hq is estimated from the fitted model M p rather 
than from the overall model. We next verify condition (24) for AlC-like procedures. Let RSS(t) denote the 
residual sum of squares from the regression employing model M x and set 

IC(x) = log {RSS(t)) + |t| T„/n (30) 

where T„ > denotes a sequence of real numbers satisfying lim^oo T„ = T and T is a positive real number. 
Of course, IC(v) = AIC(v) if T„ = 2. The model selection procedure iic is then defined as a minimizer 
(more precisely, as a measurable selection from the set of minimizers) of IC(x) over 9i. It is well-known that 
the probability that i IC selects an incorrect model converges to zero. Hence, elementary calculations show 
that condition (24) is satisfied for c = T 1 / 2 . 

The analysis of post-model-selection estimators based on AlC-like model selection procedures given in this 
section proceeded by bringing this case under the umbrella of the results obtained in Section 2. Verification of 
condition (24) is the key that enables this approach. A complete analysis of post-model-selection estimators 
based on AlC-like model selection procedures, similar to the analysis in Section 2 for the 'general-to-specific' 
model selection procedure, is certainly possible but requires a direct and detailed analysis of the distribution 
of this post-model-selection estimator. [Even the mild condition that 9t contains Xf u u and also at least one 
element t* satisfying | tr^ | = P — 1 can then be relaxed in such an analysis.] We furthermore note that in the 
special case where 91 = {vf u u,t*} and an AlC-like model selection procedure as in (30) is used, the results 
in the above theorem in fact hold for all 9 £ M Zr . 

4 Remarks and Extensions 

Remark 4.1 Although not emphasized in the notation, all results in the paper also hold if the elements of 
the design matrix X depend on sample size. Furthermore, all results are expressed solely in terms of the 
distributions P n £,&{■) of Y, and hence they also apply if the elements of Y depend on sample size, including 
the case where the random vectors Y are defined on different probability spaces for different sample sizes. 

Remark 4.2 The model selection procedure considered in Section 2 is based on a sequence of tests which 
use critical values c p that do not depend on sample size and satisfy < c p < oo for O < p < P. If these 
critical values are allowed to depend on sample size such that they now satisfy c n . p — * Coo iP as n — * oo with 
< Coo )P < oo for O < p < P, the results in Leeb and Potschcr (2003) as well as in Leeb (2005, 2006) 
continue to hold; see Remark 6.2(i) in Leeb and Potscher (2003) and Remark 6.1(h) in Leeb (2005). As a 
consequence, the results in the present paper can also be extended to this case quite easily. 

Remark 4.3 The 'impossibility' results given in Theorems 2.3 and 3.1 (as well as the variants thereof 
discussed in the subsequent Remarks 4.4-4.7) also hold for the class of all randomized estimators (with 
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P* g a replacing P n fi,a in those results, where P* e a denotes the distribution of the randomized sample). 
This follows immediately from Lemma 3.6 and the attending discussion in Leeb and Potscher (2006a). 

Remark 4.4 a. Let ip n e a denote the expectation of 9 under P n ,e,a, and consider the cdf H n j t(T (t) = 
P n ,6,cr(\/nA(6 — ^,0,(7 ) — Results for the cdf H nt e t<T quite similar to the results for G n ,e,a obtained 
in the present paper can be established. A similar remark applies to the post-model-selection estimator 
9 considered in Section 3. 

b. In Leeb (2006) also the cdf G* e is analyzed, which correspond to a (typically infeasible) model 
selection procedure that makes use of knowledge of a. Results completely analogous to the ones in the 
present paper can also be obtained for this cdf. 

Remark 4.5 Results similar to the ones in Section 2.2.2 can also be obtained for estimation of the asymp- 
totic cdf Goo,e,a(t) (or of the asymptotic cdfs corresponding to the variants discussed in the previous remark). 
Since these results are of limited interest, we omit them. In particular, note that an 'impossibility' result 
for estimation of G 00j 6i i(T (i) per se does not imply a corresponding 'impossibility' result for estimation of 
Gn,8,tr(t), since G nt i(r (t) does in general not converge uniformly to G 00i 6» !(T (t) over the relevant subsets in 
the parameter space; cf. Appendix B. [An analogous remark applies to the model selection procedures 
considered in Section 3.] 

Remark 4.6 Let n n ,e t(T (p) denote the model selection probability P n ,e,a{p = p), O < p < P corresponding 
to the model selection procedure discussed in Section 2. The finite-sample properties and the large-sample 
limit behavior of these quantities are thoroughly analyzed in Leeb (2006); cf. also Leeb and Potscher (2003). 
For these model selection probabilities the following results can be established which we discuss here only 
briefly: 

a. The model selection probabilities ^ n ,e,a{p) converge to well-defined large-sample limits which wc denote 
by 7Too,0,<t(p)- Similar as in Proposition B.l in Appendix B, the convergence of ir n ,e,a(p) to TToo,e,a(p) 
is non-uniform w.r.t. 9. [For the case = 0, this phenomenon is described in Corollary 5.6 of Leeb 
and Potscher (2003).] 

b. The model selection probabilities Tr n ,0^(p) can be estimated consistently. However, uniformly consis- 
tent estimation is again not possible. A similar remark applies to the large-sample limits n OOi 0^(p). 

Remark 4.7 'Impossibility' results similar to the ones given in Theorems 2.3 and 3.1 for the cdf can also be 
obtained for other characteristics of the distribution of a linear function of a post-model-selection estimator 
like the mean-squared error or the bias of y/nA9. 

5 On the Scope of the Impossibility Results 

The non-uniformity phenomenon described, e.g., in (20) of Theorem 2.3 is caused by a mechanism that can 
informally be described as follows. Under the assumptions of that theorem, one can find an appropriate 9 
and an appropriate sequence #„ = 6 + j/^/n exhibiting two crucial properties: 
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a. The probability measures -P n ,#„,<T corresponding to d n are 'close' to the measures P n fi,o corresponding 
to 9, in the sense of contiguity. This entails that an estimator, that converges to some limit in probability 
under P n ,g,<7, converges to the same limit also under P n ,0 n ,<r- 

b. For given t, the estimands G n ,$ n ,cr{t) corresponding to are 'far away' from the estimands G n ,6,cr(t) 
corresponding to 0, in the sense that G n ^ n ^{t) and G ni g^(t) converge to different limits, i.e., 
Goo,e,<r,o(i) is different from Goo^^ n (t). 

In view of Property a, an estimator G n (t) satisfying G n {t) — G n ^^{t) — > in probability under P n .e.m & lso 
satisfies G n (t) — G n fi^{t) — > in probability under P n ,-d n ,a- In view of Property b, such an estimator G n {t) 
is hence 'far away' from the estimand G n ^ n ^{t) with high probability under P n $ n ,a- In other words, an 
estimator that is close to G n fi,<r{t) under P n fi,a must be far away from G n ^ nta (t) under P n ,#„,cr- Formalized 
and refined, this argument leads to (20) and, as a consequence, to the non-existence of uniformly consistent 
estimators for G n j i(T (t). [There are a number of technical details in this formalization process that need 
careful attention in order to obtain the results in their full strength as given in Sections 2 and 3.] 

The above informal argument that derives (20) from Properties a and b can be refined and formalized in 
a much more general and abstract framework, see Section 3 of Leeb and Potscher (2006a) and the references 
therein. That paper also provides a general framework for deriving results like (21) and (22) of Theorem 
2.3. The mechanism leading to such lower bounds is similar to the one outlined above, where for some of 
the results the concept of contiguity of the probability measures involved has to be replaced by closeness of 
these measures in total variation distance. We use the results in Section 3 of Leeb and Potscher (2006a) to 
formally convert Properties a and b into the 'impossibility' results of the present paper; cf. Appendix C. 

Verifying the aforementioned Property a in the context of the present paper is straightforward because 
we consider a Gaussian linear model. What is technically more challenging and requires some work is the 
verification of Property b; this is done in Appendix A inter alia and rests on results of Leeb (2002, 2005, 
2006). 

Two important observations on Properties a and b are in order: First, Property a is typically satisfied 
in general parametric models under standard regularity conditions; e.g., it is satisfied whenever the model is 
locally asymptotically normal. Second, Property b relies on limiting properties only and not on the finite- 
sample structure of the underlying statistical model. Now, the limit distributions of post-model-selection 
estimators in sufficiently regular parametric or semi-parametric models are typically the same as the limiting 
distributions of the corresponding post-model-selection estimators in a Gaussian linear model (see, e.g., Sen 
(1979), Potscher (1991), Nickl (2003), or Hjort and Claeskens (2003)). Hence, establishing Property b for 
the Gaussian linear model then typically establishes the same result for a large class of general parametric 
or semi-parametric models. 1 For example, Property b can be verified for a large class of pre-test estimators 
in sufficiently regular parametric models by arguing as in Appendix A and using the results of Nickl (2003) 
to reduce to the Gaussian linear case. Hence, the impossibility result given in Theorem 2.3 can be extended 

'Some care has to be taken here. In the Gaussian linear case the finite-sample cdfs converge at every value of the argument t, 
cf. Propisition 2.1. In a general parametric model, sometimes the asymptotic results (e.g., Hjort and Claeskens (2003, Theorem 
4.1)) only guarantee weak convergence. Hence, to ensure convergence of the relevant cdfs at a given argument t as required 
in Proberty b, additional considerations have to be employed. [This is, however, of no concern in the context discussed in the 
next but one paragraph in this section.] 
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to more general parametric and semiparametric models with ease. The fact that we use a Gaussian linear 
model for the analysis in the present paper is a matter of convenience rather than a necessity. 

The non-uniformity results in Theorem 2.3 are for (conservative) 'general-to-specific' model selection 
from a nested family of models. Theorem 3.1 extends this to more general (conservative) model selection 
procedures (including AIC and related procedures) and to more general families of models. The proof of 
Theorem 3.1 proceeds by reducing the problem to one where only two nested models are considered, and then 
to appeal to the results of Theorem 2.3. The condition on the model selection procedures that enables this 
reduction is condition (24). It is apparent from the discussion in Section 3 that this condition is satisfied for 
many model selection procedures. Furthermore, for the same reasons as given in the preceding paragraph, 
also Theorem 3.1 can easily be extended to sufficiently regular parametric and semi-parametric models. 

The 'impossibility' results in the present paper are formulated for estimating G n ,e,<j(t) for a given value 
of t. Suppose that we are now asking the question whether the cdf G n ,e,<r(-) viewed as a function can be 
estimated uniformly consistently, where consistency is relative to a metric that metrizes weak convergence. 2 
Using a similar reasoning as above (which can again be made formal by using, e.g., Lemma 3.1 in Leeb and 
Potscher (2006a)) the key step now is to show that the function Goo.e.cr.oG) is different from the function 
GoD,e,<r,f{-)- Obviously, it is a much simpler problem to find a 7 such that the functions Goo.e.T.oO) and 
Goo,e,CT,7(') differ, than to find a 7 such that the values G 00 .6». (J .o(i) and Goo.e.o-^C*) f° r a given t differ. 
Certainly, having solved the latter problem in Appendix A, this also provides an answer to the former. This 
then immediately delivers the desired 'impossibility' result. [We note that in some special cases simpler 
arguments than the ones used in Appendix A can be employed to solve the former problem: For example, 
in case A = I the functions G (X); e jCrj o(-) and G oc _g_ IJ _~ f (-) can each be shown to be convex combinations of 
cdfs that are concentrated on subspaces of different dimensions. This can be exploited to establish without 
much difficulty that the functions G 00 .e. (T .o(') an d G 00; ji j0 . j7 (-) differ. For purpose of comparison we note that 
for general A the distributions Goo.e.a.o an d Goo.e,^ can both be absolutely continuous w.r.t. Lebesguc 
measure, not allowing one to use this simple argument.] Again the discussion in this paragraph extends to 
more general parametric and semiparametric models without difficulty. 

The present paper, including the discussion in this section, has focussed on conservative model selection 
procedures. However, the discussion should make it clear that similar 'impossibility' results plague consistent 
model selection. Section 2.3 in Leeb and Potscher (2006a) in fact gives such an 'impossibility' result in a 
simple case. 

We close with the following observations. Verification of Property b, whether it is for G 00 .e. cr .o(i) and 
GoD,e,<r,f{t) (for given t) or for G 00i e ;C r i o(-) and G 00j e j(Tj7 (-), shows that the post-model-selection estimator A9 
is a so-called non-regular estimator for AO: Consider an estimator (3 in a parametric model {P n ,p '■ fi € B} 
where the parameter space B is an open subset of Euclidean space M. d . Suppose 0, properly scaled and 
centered, has a limit distribution under local alternatives, in the sense that \fn(J5 — (/3 + 7/ 1 \fn)) converges 
in law under P n to a limit distribution ioo,/3,7(") for every 7. The estimator f3 is called regular if for 

every j3 the limit distribution i 00 ( 3. 7 (-) does not depend on 7; cf. van der Vaart (1998, Section 8.5). Suppose 
now that the model is, e.g., locally asymptotically normal (hence the contiguity property in Property a is 
satisfied). The informal argument outlined at the beginning of this section (and which is formalized in 
Lemma 3.1 of Leeb and Potscher (2006a)) then in fact shows that the cdf of any non-regular estimator 

2 Or, in fact, any metric w.r.t. which the relevant cdfs converge. 
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can not be estimated uniformly consistently (where consistency is relative to any metric that metrizes weak 
convergence). 

6 Conclusions 

Despite the fact that we have shown that consistent estimators for the distribution of a post-modcl-sclcction 
estimator can be constructed with relative ease, we have also demonstrated that no estimator of this distri- 
bution can have satisfactory performance (locally) uniformly in the parameter space, even asymptotically. 
In particular, no (locally) uniformly consistent estimator of this distribution exists. Hence, the answer to 
the question posed in the title has to be negative. The results in the present paper also cover the case of 
linear functions (e.g., predictors) of the post-model-selection estimator. 

We would like to stress here that resampling procedures like, e.g., the bootstrap or subsampling, do 
not solve the problem at all. First note that standard bootstrap techniques will typically not even provide 
consistent estimators of the finite-sample distribution of the post-model-selection estimator, as the bootstrap 
can be shown to stay random in the limit (Kulperger and Ahmed (1992), Knight (1999, Example 3)) 3 . 
Basically the only way one can coerce the bootstrap into delivering a consistent estimator is to resample from 
a model that has been selected by an auxiliary consistent model selection procedure. The consistent estimator 
constructed in Section 2.2.1 is in fact of this form. In contrast to the standard bootstrap, subsampling will 
typically deliver consistent estimators. However, the 'impossibility' results given in this paper apply to any 
estimator (including randomized estimators) of the cdf of a post-model-selection estimator. Hence, also any 
resampling based estimator suffers from the non- uniformity defects described in Theorems 2.3 and 3.1; cf. 
also Remark 4.3. 

The 'impossibility' results in Theorems 2.3 and 3.1 are derived in the framework of a normal linear 
regression model (and a fortiori these results continue to hold in any model which includes the normal 
linear regression model as a special case), but this is more a matter of convenience than anything else: As 
discussed in Section 5, similar results can be obtained in general statistical models allowing for nonlinearity 
or dependent data, e.g., as long as standard regularity conditions for maximum likelihood theory are satisfied. 

The results in the present paper are derived for a large class of conservative model selection procedures 
(i.e., procedures that select overparameterized models with positive probability asymptotically) including 
Akaike's AIC and typical 'general-to-specific' hypothesis testing procedures. For consistent model selection 
procedures - like BIC or testing procedures with suitably diverging critical values c p (cf. Bauer, Potscher, and 
Hackl (1988)) - the (pointwise) asymptotic distribution is always normal. [This is elementary, cf. Lemma 1 
in Potscher (1991).] However, as discussed at length in Leeb and Potscher (2005a), this asymptotic nor- 
mality result paints a misleading picture of the finite sample distribution which can be far from a normal, 
the convergence of the finite-sample distribution to the asymptotic normal distribution not being uniform. 
'Impossibility' results similar to the ones presented here can also be obtained for post-model-selection esti- 
mators based on consistent model selection procedures. These will be discussed in detail elsewhere. For a 

3 Brownstone (1990) claims the validity of a bootstrap procedure that is based on a conservative model selection procedure 
in a linear regression model. Kilian (1998) makes a similar claim in the context of autoregressive models selected by a conser- 
vative model selection procedure. Also Hansen (2003) contains such a claim for a stationary bootstrap procedure based on a 
conservative model selection procedure. The above discussion intimates that these claims are at least unsubstantiated. 
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simple special case such an 'impossibility' result is given in Section 2.3 of Leeb and Potscher (2006a). 

The 'impossibility' of estimating the distribution of the post-model-selection estimator docs not per se 
preclude the possibility of conducting valid inference after model selection, a topic that deserves further 
study. However, it certainly makes this a more challenging task. 

A Auxiliary Lemmas 

Lemma A.l Let Z be a random vector with values in R fe and let W be a univariate standard Gaussian 
random variable independent of Z. Furthermore, let C G R* and r > 0. Then 

¥{Z < Cx)V(\W - x\ < r) + P(Z < CW, \W - x\ > r) (31) 

is constant as a function of x G R if and only if C — or ¥{Z < Cx) = for each x G R. 

Proof of Lemma A.l: Suppose C = holds. Using independence of Z and W it is then easy to see that 
(31) reduces to ¥(Z < 0), which is constant in x. If ¥(Z < Cx) = for every x G R, then ¥{Z < CW) = 0, 
and hence (31) is again constant, namely equal to zero. 

To prove the converse, assume that (31) is constant in x G R. Letting x — > oo, we see that (31) must be 
equal to ¥(Z < CW). This entails that 

¥(Z < Cx)¥(\W -x\<t)= ¥{Z < CW, \W - x\ < t) 

holds for every x G R. Write F(x) as shorthand for ¥(Z < Cx), and let &(z) and <p(z) denote the cdf and 
density of W, respectively. Then the expression in the above display can be written as 

r-x+T 

F(x)(${x + t)-$(x-t)) = F(z)cf>(z)dz. (xGR) (32) 

J X — T 

We now further assume that C^0 and that F(x) ^ for at least one x G R, and show that this leads to a 
contradiction. 

Consider first the case where all components of C are non-negative. Since F is not identically zero, it 
is then, up to a scale factor, the cdf of a random variable on the real line. But then (32) can not hold for 
all x G R as shown in Example 7 in Leeb (2002) (cf. also equation (7) in that paper). The case where 
all components of C are non-positive follows similarly by applying the above argument to F(—x) and upon 
observing that both <I>(x + r) — $(x — r) and <j>(x) are symmetric around x = 0. 

Finally, consider the case where C has at least one positive and one negative component. In this case 
clearly lim^^-oo F(x) — lim^^oo F(x) = holds. Since F(x) is continuous in view of (32), we see that F(x) 
attains its (positive) maximum at some point x\ G R. Now note that (32) with x\ replacing x can be written 
as 

I 1 (F( Xl )-F(z))<f>(z)dz = 0. 

J X\—T 

This immediately entails that F(x) = F(xi) for each x G [xi — r, xi + r] (because F(x) is continuous and 
because of the definition of xi). Repeating this argument with x\ — r replacing x\ and proceeding inductively, 
we obtain that F(x) = F(xi) for each x satisfying x < x\ + r, a contradiction with lim^-oo F(x) = 0. □ 
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Lemma A. 2 Let M and N be matrices of dimension k x p and k x q, respectively, such that the matrix 
(M : N) has rank k (k > 1, p > 1, q > 1). Let t G R fe , and let V be a random vector with values in W 
whose distribution assigns positive mass to every (non-empty) open subset o/R p (e.g., it possesses an almost 
everywhere positive Lebesgue density). Set f(x) — P(MV < t + Nx), x G R 9 . If one of the rows of M 
consists of zeros only, then f is discontinuous at some point xq. More precisely, there exist xq G R 9 , z G R 9 
and a constant c > 0, such that f(xo + 5z) > c and f(xo — 5z) = hold for every sufficiently small S > 0. 

Proof of Lemma A. 2: The case where M is the zero-matrix is trivial. Otherwise, let Iq denote the 
set of indices i, 1 < i < k, for which the i-th row of M is zero. Let (Mo : N ) denote the matrix consisting 
of those rows of (M : N) whose index is in I , and let (Mi : N\) denote the matrix consisting of the 
remaining rows of (M : N). Clearly, M is then the zero matrix. Furthermore, note that N has full 
row-rank. Moreover, let to denote the vector consisting of those components of t whose index is in I and 
let t\ denote the vector containing the remaining components. With this notation, f(x) can be written as 
P(0 < t Q + N a x, MtV <h+ N lX ). 

For vectors /i G R p and 77 G R 9 to be specified in a moment, set t* = t + Mfi + Nrj, and let tg and t* be 
defined similarly to to and t\. Since the matrix (M : N) has full rank k, we can choose fj, and rj such that 
to = and t\ > 0. Choose z G R 9 such that Nqz > 0, which is possible because iVo has full row-rank. Set 
x =r}. Then for every e G K we have 

/(xo + ez) = f( V + ez) = F(MV < t + N(r] + ez)) 

= P(0 < to + N (r] + ez), M\V < ti + iVi(?7 + ez)) 

= P(0 < t* + eN z, Mi (V + fj) < t\ + eN lZ ) 

= P(0 < eiV z, M^V + n) < ti +eN lZ ) 

Since t* > 0, we can find a t** such that < t** < t* + eN\z holds for every e with |e| small enough. If now 
e > then 

f(x + ez) - P(Mi(V + n) < t\ + eN lZ ) > P(M 1 (V r + //) < t{*). 

The r.h.s. in the above display is positive because t** > and because the distribution of Mi(V + fi) assigns 
positive mass to any neighborhood of the origin, since the same is true for the distribution of V + \i and 
since Mi maps neighborhoods of zero into neighborhoods of zero. Setting c = P(M\(V + 11) < t**)/2, we 
have f(xo + ez) > c > for each sufficiently small e > 0. Furthermore, for e < we have f(xo + ez) = 0, 
since f(x + ez) < P(0 < eA^o^) = in view of N z > 0. □ 

Lemma A. 3 Let Z be a random vector with values in W, p > 1, with a distribution that is absolutely 
continuous with respect to Lebesgue measure on W. Let B be a kxp matrix, k > 1. Then the cdfP(BZ < ■) 
of BZ, is discontinuous at t G R fc if and only ifP(BZ < t) > and if for some i n , 1 < i < k, the iQ-th row 
of B and the i$-th component of t are both zero, i.e., B io ^. — (0, . . . , 0) and t io = 0. 

Proof of Lemma A. 3: To establish sufficiency of the above condition, let P(BZ < t) > 0, t io = and 

Bio,- — (0j • • ■ 7 0) for some io, 1 < *o < k. Then, of course, P(Bi ar Z = 0) = 1. For t n = t— n~ 1 ei a , where ei 
denotes the i -th unit vector in R fe , we have P(BZ < t n ) < ¥(B iot .Z < t„ ;io ) = V(B iot .Z < -1/ri) = for 
every n. Consequently, P(BZ < t) is discontinuous at t. 
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To establish necessity, we first show the following: If t n G R fe is a sequence converging to t E R fe as 
n — > oo, then every accumulation point of the sequence ¥{BZ < t n ) has the form 

¥(B n ..Z <t n ,..., B lm ,.Z < t lm , B lm+u .Z < t lm+1 , . . . , B ih ,.Z < U J (33) 

for some m, < m < k, and for some permutation (ii, . . . , ife) of (1, ... , k). This can be seen as follows: Let a 
be an accumulation point of ¥{BZ <t n ). Then we may find a subsequence such that ¥{BZ < t n ) converges 
to a along this subsequence. From this subsequence we may even extract a further subsequence along which 
each component of the k x I vector t n converges to the corresponding component of t monotonously, that 
is, either from above or from below. Without loss of generality, we may also assume that those components 
which converge from below are strictly increasing. The resulting subsequence will be denoted by rij in the 
sequel. Assume that the components of t nj with indices ii, ■ • ■ ,i m converge from above, while the components 
with indices i m +i, ■ ■■ ,ik converge from below. Now 

¥(BZ<t nj ) = [ T\l { _^ tn , 3] {z s )¥ BZ {dz), (34) 
Jzen k s=1 

where Pbz denotes the distribution of BZ. The integrand in (34) now converges to 
YliLi ' i -(-oo,t il ]{zi,)Yli =m+1 l(_oo,t il )(^ii) f° r a U z e K k as nj — > oo. The r.h.s. of (34) converges to the 
expression in (33) as nj — > oo by the Dominated Convergence Theorem, while the l.h.s. of (34) converges to 
a by construction. This establishes the claim regarding (33). 

Now suppose that F(BZ < t) is discontinuous at t; i.e., there exists a sequence t n converging to t as 
n — » oo, such that ¥(BZ < t n ) does not converge to ¥(BZ < t) as n — > oo. From the sequence t„ we 
can extract a subsequence t ns along which ¥(BZ < t ns ) converges to a limit different from ¥{BZ < t) as 
n s — » oo. As shown above, the limit has to be of the form (33) and m < k has to hold. Consequently, the 
limit of P(BZ < t„ e ) is smaller than ¥(BZ < t) = ¥{B l ..Z < U, i = 1, . . . , k). The difference of ¥{BZ < t) 
and the limit of ¥(BZ < t ns ) is positive and because of (33) can be written as 

¥(B ijt .Z < for each j = 1, . . . , k, B ijr Z = t^for some j = m + 1, . . . , k) > 0. 

We thus see that ¥(B ijot .Z — t ija ) > for some j satisfying m + 1 < j < k. As Z is absolutely 
continuous with respect to Lebesgue measure on W, this can only happen if . = (0, . . . , 0) and = 0. 

□ 

Lemma A. 4 Suppose that AO {q) and6 q {q) are asymptotically correlated, i.e., Coo 7^ 0, for some q satisfying 
O < q < P, and let q* denote the largest q with this property. Moreover let 9 e M 9 »_i, let a satisfy 
< a < oo, and let t e R fe . Then G 00i e j0 - !7 (t) is non-constant as a function of 7 e M g .\M g ._i. More 
precisely, there exist S > and p , < p < oo, such that 

SU P |Goo,0,<7, 7 (i>(*) - GooAa/yOiWl > 2S (35) 

y( 1 1,~,( 2 1eM q ,\M q ,_ 1 

|| 7 W||<p 01 i=l,2 

holds. The constants So and p can be chosen in such a way that they depend only on t, Q, A, a, and the 
critical values c p for O < p < P. 
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Lemma A. 5 Suppose that A9(q) and 9 q (q) are asymptotically correlated, i.e., C^} ^ 0, for some q satisfying 
O < q < P, and let q* denote the largest q with this property. Suppose further that for some p Q satisfying 
O < Pq < q* either Pq — holds or that p Q > and A[p Q ] has a row of zeros. Then, for every 9 € M p@ , 
every o, < a < oo, and every t e R fe the quantity Goo^ !(T , 7 (t) is discontinuous as a function of 7 e M q - . 
More precisely, for each s = O,. . . ,p Q , there exist vectors [3* and 7* in M q * and constants <5* > and > 
such that 

iGooAcr^+eT^t) — G 00j 6/,<r,/3 > ,-e7 t (i)| > 5* (36) 

holds for every 9 satisfying max{po(9) , O} — s and for every e with < e < e*. The quantities S*, e*, (3^, 
and 7 + can be chosen in such a way that - besides t, Q, A, a, and the critical values c p for O < p < P - 
they depend on 9 only through max{po(#), O}. 

Before we prove the above lemmas, we provide a representation of GooAo-^M that will be useful in 
the following: For < p < P define Z p = J2r=i £oo,rCaoW r , where C^} has been defined after (13) and 
the random variables W r are independent normally distributed with mean zero and variances cr 2 ^ r ; for 
convenience, let Z denote the zero vector in R*. Observe that Z p , p > 0, is normally distributed with 
mean zero and variance-covariance matrix cr 2 ^4[p](5[p : p] _1 ^4[p]' since it has been shown in the proof of 
Proposition 4.4 in Leeb and Potscher (2006b) that the asymptotic variance-covariance matrix a 2 A[p]Q[p : 
p]~ 1 A[p\ l of ^/nA9(p) can be expressed as Y^r=i ^iZorCooC^'. Also the joint distribution of Z p and the 
set of variables W r , 1 < r < P, is normal, with the covariance vector between Z p and W r given by a 2 C^} in 
case r < p; otherwise Z p and W r are independent. Define the constants v r = -f r + (Q[r : r]~ 1 Q{r : ^r]j[^r]) r 
for < r < P. It is now easy to see that for p > p* = max{p (9) , O} the quantity (3^ defined in 
Proposition 2.1 equals — J2r= P +i €^,rCco v r - [This is seen as follows: It was noted in Proposition 2.1 that 
= linin^oo y/nA(ri n (p) - 9 - j/y/n) for p > po(9), when n n (p) is defined as in (9), but with 9 + j/^/n 
replacing 9. Using the representation (20) of Leeb (2005) and taking limits, the result follows if we observe 
that \/nri n r (r) — > v r for r > p > Po(&)-] The cdf in (15) can now be written as 

p(z P .<t+ c 2 ,cl r O n n\W q + v q \<c q a^ q ) 

\ r=p, + l / q =p,+l 

p I p \ p 

+ E P [ Z p ^ 1 + E C^"- \W P + v v \ > c P<oo, P II P d^ + "«l < (37) 

p=p, + l \ r=p+l / q=p+l 

That the terms corresponding to p = p„ in (37) and (15) agree is obvious. Furthermore, for each p > p* 
the terms under the product sign in (37) and (15) coincide by definition of the function A s (a, b). It is also 
easy to see that the conditional distribution of W p given Z p = z is Gaussian with mean boo^z and variance 
<j2 CL,p- Consequently, the probability of the event {\W P + v p \ > CpV^ p } conditional on Z p = z is given by 
the integrand shown in (15). Since Z p has distribution $oo jP as noted above, it follows that (37) and (15) 
agree. 

Remark A.6 If C ( £ } = for p > p*, then in view of the above discussion Z Pt = Z. p = Zp, and hence 
^oa,p, = ^oo^ = &oo,p, holds for all p > p*. Using the independence of W r , r > p*, from Z Pt , inspection of 
(37) shows that Goo,e,<T, 7 reduces to $oo,p; see also Leeb (2006, Remark 5.2). 

Proof of Lemma A. 4: From (37) (or (15)) it follows that the map 7 Goo^,^^) depends only on t, 
Q, A, (j, the critical values c p for O < p < P, as well as on 9; however, the dependence on 9 is only through 
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p* = max{p o (0), O}. It hence suffices to find, for each possible value of p* in the range p* = O, . . . , q* — 1, 
constants < p < oo and So > such that (35) is satisfied for some (and hence all) 9 returning this 
particular value of p* = max{p (#), O}. For this in turn it is sufficient to show that for every 9 e M g »_i the 
quantity Goo.e.a^C*) is non-constant as a function of 7 g M 9 »\M 9 ._i. 

Let g M g »_i and assume that G , o,e,cr,-y (*) is constant in 7 G M 9 .\M 9 »_i. Observe that, by assumption, 
^ is non-zero while C$ = for p > q* . For 7 g M q * , we clearly have v q - — j q , and = for r > q* . 
Letting 7 9 »_ 1 — > 00 while 7 g » is held fixed, we see that ^ 9 »_i — > 00; hence, 

P(|W,._i + < c g .-i^ 00 ,,._i) - 0. 

It follows that (37) converges to 

p 

P(^.-I<t + C.^*V)F(I^+Vl<c,.<,.) II n\W q \<c q a^ q ) 

q=q'+l 

P 

+P (Z,. < t, \W q . +l q A> c r < >9 .) [] P d^l < c^oo,,) (38) 

q=q'+l 

P P 

+ p (^> < *- \ W p\ > Cpvtoo*) II P d^l < c q a^ q ). 

p=q*+l q=p+l 

By assumption, the expression in the above display is constant in j q « g R\{0}. Dropping the terms that do 
not depend on 7 » and observing that P(|W 9 | < CgO^^ 9 ) is never zero for q > q* > 0, we see that 

+ P(Z,. <i, |W,.+7,.|> VW) ( 39 ) 

has to be constant in 7 9 » g R\{0}. We now show that the expression in (39) is in fact constant in j q , g R: 
Observe first that P(|W 9 » + 7 9 * | < c^o^^g.) is positive and continuous in ~{ q , g R; also the probability 
P {Z q * < t, \W q * + 7 , 1 > Cg.a^ g.) is continuous in 7 „ g R since W q * , being normal with mean zero and 
positive variance, is absolutely continuously distributed. Concerning the remaining term in (39), we note 
that = MV where M = [£~*iC£\ . . . , £^g._iC«* and V = (Wi, . . . , W 9 ._i)'. In case no row of 

M is identically zero, Lemma A. 3 shows that also P [z q *-\ <t + £^g*£°° ^l q *^j is continuous in j q » g R. 
Hence, in this case (39) is indeed constant for all -f q * g R. In case a row of M is identically zero, define 
N = ^^C { £ ] and rewrite the probability in question as P (MV < t + A^,). Note that ( M : w ) has ful1 
row-rank fc, since 

9* p 

(M : A)^^, . . . ,£,,,.](M : A)' ^^C^^' = EC^C^' = ( 40 ) 

r— 1 r— 1 

by definition of q* and since the latter matrix is non-singular in view of rank A = k. Lemma A. 2 then shows 
that there exists a j q °? g R, z g {-1, 1}, and a constant c> such that P (MV < t + N (7^ - <5z)) = 
and P (MV < « + A (7^? + <Sz)) > c holds for arbitrary small S > 0. Observe that 7^? — <5z as well as 
7^? — (5z are non-zero for sufficiently small S > 0. But then (39) - being constant for g R\{0} - gives 
the same value for 7 » = 7^? — (5z and 7 ? » = 7^? + <5z and all sufficiently small 5 > 0. Letting S go to zero 
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in this equality and using the continuity properties for the second and third probability in (39) noted above 
we obtain that 

cP(|WV +7^1 < V^^jlP^. < t, \W q , + 7 <°>| > c q .a^ q .) 

< liminfP (z q ._ x <t + ^ 2 q ,Ct'\lP + Szj) P(\W q . +*y$>\ < c,.^,.) 

+ v(z q . <t, |W<r +7^1 ><V 
= liminfP [z q ._ x < t + ^ 2 q ,Ct'HlP - Szj) F(\W g . + 7 ^l < 

+ P(Z 9 , <t, +7^1 > V^W) 

= v(z q .<t, \W q ,+ lq )\>c q ^ OOtq ,) 

which is impossible since c > and P(|W ? * + j q ?\ < c^a^g,) > 0. Hence we have shown that (39) is 
indeed constant for all j q , G R. 

NowwriteZ, W, C, r, and a; for Z q *_\ — t, — W q * /a^g, , <t£^ 9 . \ c q », and 7g*/ a £oo,g*' respectively. 
Upon observing that Z g * equals Z q *_\ +£^9*^°° 'Wg* , it is easy to see that (39) can be written as in (31). 
By our assumptions, this expression is constant in x = 7g */ C7 £oo,g* € R- Lemma A.l then entails that either 
C = or that P(Z < Cx) = for each x e R. Since C equals o^^^C^ \ it is non-zero by assumption. 
Hence, 

p(Vi<H^,.^\)=° 

must hold for every value of 7 , . But the above probability is just the conditional probability that Z q * < t 
given W q * = — j q ,. It follows that P(Z g » < t) equals zero as well. By our assumption C^) = for p > q* , 
and hence Z q » = Zp. We thus obtain P(Zp < t) — 0, a contradiction with the fact that Zp is a Gaussian 
random variable on R k with non-singular variance-covariance matrix a 2 AQ~ 1 A'. □ 

Inspection of the above proof shows that it can be simplified if the claim of non-constancy of G 00j e j(Tj7 (t) 
as a function of 7 € M 9 »\M g »_i in Lemma A. 4 is weakened to non-constancy for 7 € M 9 ». The strong form 
of the lemma as given here is needed in the proof of Proposition B.l. 

Proof of Lemma A. 5: Let p© be the largest index p, O < p < P, for which A[p] has a row of zeroes, and 
set p© = if no such index exists. We first show that p© satisfies p© < q* . Suppose p© > q* would hold. Since 
Z Pm is a Gaussian random vector with mean zero and variance-covariance matrix cr 2 A[p©](5[p© : p©] _1 A[p©]', 
at least one component of Z P9 is equal to zero with probability one. However, Z Pe equals Zp because of 
P® > Q* an d the definition of q*. This leads to a contradiction since Zp has the non-singular variance- 
covariance matrix a 2 AQ^ 1 A' '. Without loss of generality, we may hence assume that p Q — p©. 

In view of the discussion in the first paragraph of the proof of Lemma A. 4, it suffices to establish, for 
each possible value s in the range O < s < Pq, the result (36) for some 9 with s = max{po(#), = P*- Now 
fix such an s and 9 (as well as, of course, t, Q, A, a, and the critical values c p for O < p < P). Then (37) 
expresses the map 7 1— > G oc _g^_~ f (t) in terms of v = (yi, . . . , vp)' . It is easy to see that the correspondence 
between 7 and v is a linear bijection from R p onto itself, and that 7 € M q * if and only if v £ M q * . It is 
hence sufficient to find a 5* > and vectors v and /i in M q , such that (37) with v + e/j, in place of v and 
(37) with v — e[i in place of v differ by at least 5* for sufficiently small e > 0. Note that (37) is the sum of 
P — p* + 1 terms indexed by p = p* , . . . , P. We shall now show that v and \x can be chosen in such a way 
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that, when replacing v with v + t[i and v — e/i, respectively, (i) the resulting terms in (37) corresponding to 
p = Pq differ by some d > 0, while (ii) the difference of the other terms becomes arbitrarily small, provided 
that e > is sufficiently small. 

Consider first the case where s = = Pq- Using the shorthand notation 



i 



g(v)=P\Z Pe <t+ V £" 2 r C£) 



-<p - o t / , 

r=p +l 



note that the p©-th term in (37) is given by g(y) multiplied by a product of positive probabilities which are 
continuous in v. To prove property (i) it thus suffices to find a constant c > 0, and vectors v and fi in M q * 
such that \g(v + e/i) — g(y — e/i)| > c holds for each sufficiently small e > 0. 

In the sub-case p Q = choose c = 1, set 

-l 



v = -[c£\...,c^y 



p 

Ec-2 (~<{r)fi{r)l 
Soo.r^oo ^oo 

Lr=l 



t 



and 



M =[C«,...,C(f)]' 



■ p 

EC-2 (~<{r)ri{r)i 



(!,...,!)', 



|_r=l 

observing that the matrix to be inverted is indeed non-singular, since - as discussed after Lemma A. 5 - it 
is up to a multiplicative factor a 1 identical to the variance-covariance matrix a 1 AQ^ 1 A' of Zp. But then 

v and [i satisfy Er= Po +i &£,r C ™ v r = ~t and EL P0 +i ^r-C^Vr = (1, • ■ • , !)' if wc notc tnat by the 
definition of q* 

q* P 

V r 2 c^v - Vr 2 c (r ^ 

/ , , Socxr^oo — / J Soo.r^oo 
»"=Po+l r=1 

holds and that a similar relation holds with \i replacing v. Since Z PQ = Z = e K fe , it is then obvious that 
<7(zv + e/x) and g(y — differ by 1 for each e > 0. 

In the other sub-case P& > 0, define M = [t^cQ , . . . ,^ 2 p0 C^ o) ], N = 
K^ P0 +iC~ o+1) , ■ • ■ ,£^ 2 g *C^ ^, and V = (Wi, . . . , W PQ )' . It is then easy to see that g[y) equals 
f((v PQ+ i, . . . , Vq')'), with / defined as in Lemma A. 2, and that M has a row of zeros. Furthermore, the 
matrix (M : N) has rank k by the same argument as in the proof of Lemma A. 4; cf. (40). By Lemma A. 2, 
we thus obtain vectors x and z, and a c > such that \f(x + ez) — f(x — ez)\ > c holds for each sufficiently 
small e > 0. Setting (v PQ+ i, . . . ,v q *)' — x , (m P0 +i> • • ■ = z -> setting z/[-i<7*], and [i\^q*\ each equal to 

zero, and setting v\p Q \ and /u[p©] to arbitrary values, we see that g(y ± e/i) has the desired properties. 

To complete the proof in case s = p* = p Q , we need to establish property (ii) for which it suffices to 
show that, for p > p@, the p-th term in (37) depends continuously on v. For p > q* , the p-th term does not 
depend on v , because Coo = for r = q*, . . . , P. For p satisfying p Q < p < q* , it suffices to show that 

h(v p , ...,v q ,)=F lz p <t+ C 2 rC£ } ^, \W P + v p \ > c p aU, P 

y r=p+l 

is a continuous function. Suppose that [v p m \ . . . , v q ™^) converges to (y p , . . . , v q * )asnn oo. For arbitrary 
= P +i Cco 2 rC» "r and ELp+i ' differ by less than a in each coordinate, provided that 



2G 



m is sufficiently large. This implies 

limsup/i^,...,^) 

'CO 

<limsupP(Z p <i + V C 2 r CW, r + a (l,...,l)', l% + 4 m )| >c p( reoo,p) 

m^oo . - 

r— p-\-l 

<?* 

= F(Z p <t + J2 + o(l, . . . , 1)', |W P + > tyr^), 

r— p+1 

observing that the latter probability is obviously continuous in the single variable v p (since W p has an ab- 
solutely continuous distribution). Letting a decrease to zero we obtain limsup m ^ 00 h(u p m \ . . . , is q T^) < 
h(is p , . . . , v q *). A similar argument establishes liminfm^oo h{v p m \ . . . , v^) > V(Z p < t + 
J2l= P +i (oo^C^^n I Wp + v p\ ^ c p a ioo. P )- The proof of the continuity of h is then complete if we can show 
that V(Z P < ■, \W P + v p \ > CpC^oo p) is continuous or, equivalcntly that P [Z p < ■ \ \W p + v p \ > c^cr^^ ) 
is a continuous cdf. Since p > p Q , the variance-covariance matrix cr 2 A[p](5[p : p] _1 .A[p]' of Z p does only 
have non-zero diagonal elements. Consequently, when representing Z p as B(W\, . . . ,W P )' , the matrix B 
cannot have rows that consist entirely of zeros. The conditional distribution of (Wi, . . . ,W P )' given the 
event {\W P + v p \ > CpCT^ p } is clearly absolutely continuous w.r.t. p-dimensional Lebesgue measure. But 
then Lemma A. 3 delivers the desired result. 

The case where s = < p Q is reduced to the previously discussed case as follows: It is easy to see that, 
for v p& — ► oo, the expression in (37) converges to a limit uniformly w.r.t. all v p with p ^ p Q . Then observe 
that this limit is again of the form (37) but now with p Q taking the role of p* . □ 



B Non-Uniformity of the Convergence of the Finite-Sample Cdf 
to the Large-Sample Limit 

Proposition B.l a. Suppose that A9(q) and q (q) are asymptotically correlated, i.e., ^ 0, for some 
q satisfying O < q < P, and let q* denote the largest q with this property. Then for every 9 <G M g »_i 7 
every a, < a < oo, and every t G R fc there exists a p, < p < oo, such that 

liminf sup |Gn,0,<r(*) - Goo,0,<r(*)l > (41) 
\\if-e\\<p/v^ 

holds. The constant p may be chosen in such a way that it depends only on t, Q, A, a, and the critical 
values c p for O < p < P. 

b. Suppose that A9{q) and 6 q {q) are asymptotically uncorrected, i.e., C^} — 0, for all q satisfying O < 
q < P. Then G n ,e,<j converges to $>oo,p in total variation uniformly in 9 e R p ; more precisely 

SUP SUp \\G nfi ,a ~ $oo,p|| T y 
(j*<a<<7* 

holds for any constants a* and a* satisfying < u* < a* < oo. 

Under the assumptions of Proposition B.l(a), we see that convergence of G nt g ttT (i) to G 00i e i(T (t) is non- 
uniform over shrinking 'tubes' around M q *-\ that are contained in M q * . [On the complement of a tube with 
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a fixed positive radius, i.e., on the set U = {9 £ R p : \6 q * \ > r} with fixed r > 0, convergence of G ni ;Cr (i) 
to Goo, 0,<t (*) is hi fact uniform (even with respect to the total variation distance), as can be shown. Note 
that for 9 £ U the cdf G 00; 6i i0 -(f) reduces to the Gaussian cdf $oo,p(i), i-e., to the asymptotic distribution of 
the least-squares estimator based on the overall model; cf. Remark A. 6.] A precursor to Proposition B.l(a) 
is Corollary 5.5 of Leeb and Potscher (2003) which establishes (41) in the special case where = and 
where A is the P x P identity matrix. Proposition B.l(b) describes an exceptional case where convergence 
is uniform. [In this case Goo,0,<t reduces to the Gaussian cdf $oo,p for all 9 and $00. p = $oo,pi O < p < P, 
holds; cf. Remark A. 6.] Recall that under the assumptions of part (b) of Proposition B.l we necessarily 
always have (i) O > 0, and (ii) rank A[0] = k; cf. Proposition 4.4 in Leeb and Potscher (2006b). 

Proof of Proposition B.l: We first prove part (a). As noted at the beginning of the proof of Lemma 
A. 4, the map 7 1— > G 00 ^^ (J ^ 7 (i) depends only on t, Q, A, a, the critical values c p for O < p < P, as well as on 
8, but the dependence on 9 is only through p* = max{p (#), O}. It hence suffices to find, for each possible 
value of p* in the range = O, . . . , q* — 1, a constant < p < 00 such that (41) is satisfied for some (and 
hence all) 9 returning this particular value of p* = m&x{p (9),O}. For this in turn it is sufficient to show 
that given such a 9 we can find a 7 £ M q * such that 

l G «,0+7/V^W - G oo,0+ 7 /VH, CT (*)l > (42) 

holds. Note that (42) is equivalent to 

liminf IGoco,^*) - Goo,j+ 7 /Vn,<r(*)l > (43) 

n — »oo " v 

in light of Proposition 2.1. To establish (43), we proceed as follows: For each 7 £ M q , with 7 g » 7^ 0, 
0+7/v^>o-(*) m (15) reduces to <&oo l9 »(£) as is easily seen from (37) since p (9 + j / y/n) = q* which in turn 
follows from po(9) < q* and "f q , ^ 0. Furthermore, Lemma A. 4 entails that G 00i e^ (J;7 (t) is non-constant in 
7 £ Mg»\M 9 »_i. But this shows that (43) must hold. 

To prove part (b), we write 

p 



\G n ,9,a ~ $00, p| 



TV 



^2 G n,9,a(-\p)^n,e^(p) ~ $oo,p(\ 



p 



TV 



< J! ||G„, 9 , (T (-|p)-$ C o,p(-)|| T y 

where the conditional cdfs G ny g^{-\p) and the model selection probabilities TT n ,0,o-{p) have been introduced 
after (12). By the 'uncorrelatedness' assumption, we have that <&oo.p = $00, p for allp in the range O < p < P; 
cf. Remark A. 6. We hence obtain 

p 

Sup Sup ||G„,0,<7 - $oo,p|| T y < Sup Sup \\G n ,e,a{-\p) ~ $oo,p(-)l ItV n n,0,a(p)- (44) 



o 0GRf «ER 



Now for every p with O < p < P and for every p, < p < oo, we can write 

SUp SUp \\G n .e. a {-\p) - <^oo.p{-)\\ Tv ^n,e.a{p) 

0eR p rfR 

(T* <cr<<7* 



< max < sup sup \\G nt g^(-\p) - ®oo, P {-)\\ TV , sup sup i^ n ,e,a{p) 

(,||0hp]||<p/yH^*< CT < CT * l|0hp]||>,V^ ff *^ ff * 



> . 



(45) 
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In case p = P, we use here the convention that the second term in the maximum is absent and that the first 
supremum in the first term in the maximum extends over all of M p . Letting first n and then p go to infinity 
in (45), we may apply Lemmas C.2 and C.3 in Leeb and Potscher (2005b) to conclude that the l.h.s. of (45), 
and hence the l.h.s. of (44), goes to zero as n — > oo. □ 



C Proofs for Sections 2.1 to 2.2.2 

In the proofs below it will be convenient to show the dependence of & n ,p and $00^ on a in the notation. Thus, 
in the following we shall write & n ,p,o and ^oo^.o-, respectively, for the cdf of a /c-variate Gaussian random 
vector with mean zero and variance-covariance matrix cr 2 A[p](X[j3]'X[p]/n) _1 yl[p] / and a 2 A\p]Q\p : p] 
respectively. For convenience, let $> n ,o,<j and $ 00,0.0- denote the cdf of point-mass at zero in R . 

The following lemma is elementary to prove, if we recall that b UiP z converges to boo. p z as n — ► 00 for 
every z £ lmA[p], the column space of A\p]. 

Lemma C.l Suppose p > O. Define R rhp (z,<j) = 1 — A a ^ n p (b n , p z,c p a^ n p ) and Roo, P (z,o-) = 1 — 
^Coc p (boo, P z, Cpf£oo, P ) for z e ImAJp], < a < 00. Let converge to a, < a < 00. If Coo, P 0, then 
R n>p (z,cr^) converges to i?oo,p(z, er) for every z <E lmA\p]; «/ Coo.p = 0; then convergence holds for every 
z e Im A\p], except possibly for z e Im A[p] satisfying l&oo.p- 2 ! = Cp^Coo.p- [This exceptional subset o/Imi[p] 
has rank(A[p\) -dimensional Lebesgue measure zero since CpCr^ > 0./ 

The following observation is useful in the proof of Proposition 2.2 below: Since the proposition depends 
on Y only through its distribution (cf. Remark 4.1), we may assume without loss of generality that the 
errors in (5) are given by u t — ae t , t £ N, with i.i.d. e t that are standard normal. In particular, all random 
variables involved are then defined on the same probability space. 

Proof of Proposition 2.2: Since P n ,e,a(P = Po{6)) — *• 1 by consistency, we may replace max{p, O} 
by P* = max{p o (60, 0} in the formula for G n for the remainder of the proof. Furthermore, since a — > a 
in P n) e i(T -probability, each subsequence contains a further subsequence along which a — > a almost surely 
(with respect to the probability measure on the common probability space supporting all random variables 
involved), and we restrict ourselves to such a further subsequence for the moment. In particular, we write 
{<7 — > a} for the event that a converges to a along the subsequence under consideration; clearly, the event 
{<7 — > a} has probability one. Also note that we can assume without loss of generality that a > holds 
on this event (at least from some data-dependent n onwards), since a > holds. But then obviously 

n£= P „+i A ^ n , q (°' c <i^n, q ) converges to l\q= P ,+i A ^oo, q ( ' c 9 CT ^oo, g )- and ^n, P »(*) converges to $00^,0-^) 
in total variation by Lemma A. 3 of Leeb (2005) in case > 0, and trivially so in case = 0. This proves 
that the first term in the formula for G n converges to the corresponding term in the formula for Goo, 8,a in 
total variation. 

Next, consider the term in G n that carries the index p > p*. By Lemma A. 3 in Leeb (2005), = $ n ,p,o- 
has a density d& n ,p,a/d&oo, P ,<T with respect to $oo,p,<r, which converges to 1 except on a set that has measure 
zero under $oo,p,ct- By Scheffe's Lemma (Billingslcy (1995), Theorem 16.12), d$ n , P ,a/d& 00, v ,a converges to 1 
also in the L 1 ($ o,p,o-)-sense. By Lemma C.l, R n , P {z, a) converges to Roo,p{z, a) except possibly on a set that 
has measure zero under $oo,p,cr- (Recall that $oo,p,o- is concentrated on Im^4[p] and is not degenerate there.) 
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Observing that \R ntP (z, a)\ is uniformly bounded by 1, we obtain that R ntP (z,a) converges to Roo, P (z 7 cr) 
also in the oo,p,<r)~sense. Hence, 



Rn, P (z,a) 



d<$> 



-(z) - i? 00iP (z,cr) 



oo.p.a 



< 



< 



R nA Z ^) ^ n ' P ' a ( Z ) - Rn,p( Z >v) 



d<S> c 



-{z)-l 



\Rn, P (z,a) - Roo, P (z,a) 



(46) 



\Rn,p( Z '°) - Roo,p{z,a)\\ 



where 



denotes the L 1 (<i> 00i p i(J )-norm. Since Ilf=p+i , (0> c qv£n,q) obviously converges to 
nj= p +i ^ff^ „ (0) c q a ^ca, q )^ tnc relation (46) shows that the term in G„ carrying the index p converges 
to the corresponding term in Goo^o- in the total variation sense. This proves (18) along the subsequence 
under consideration. However, since any subsequence contains such a further subsequence, this establishes 
(18). Since G n j^ converges to G^^^ in total variation by Proposition 2.1, the claim in (17) also follows. □ 
Before we prove the main result we observe that the total variation distance between P n ,6,a and 



Pn 



isfy 



satisfies | 

#(") 



,<?\\TV 



< 2$(||0-0|| \]H^X'X)/2(j) - 1; furthermore, if 9 {n) and d {n) sat- 



= 0(n 1//2 ), the sequence P n ^( n ) a is contiguous with respect to the sequence P n a 
(and vice versa). This follows exactly in the same way as Lemma A.l in Leeb and Potscher (2006a). 

Proof of Theorem 2.3: We first prove (20) and (21). For this purpose we make use of Lemma 3.1 in 
Leeb and Potscher (2006a) with a = 9 e M 9 ._i, B = M q », B n = {■& e M q , : \\d - 9\\ < p^- 1 / 2 }, f3 = tf, 
<Pn{P) = G nj # j0 -(t), ip n — G n (t), where p a , < p a < oo, will be chosen shortly (and a is held fixed). The 
contiguity assumption of this lemma (as well as the mutual contiguity assumption used in the corrigendum 
to Leeb and Potscher (2006a)) is satisfied in view of the preparatory remark above. It hence remains only to 
show that there exists a value of p , < p Q < oo, such that 5* in Lemma 3.1 of Leeb and Potscher (2006a) 
(which represents the limit inferior of the oscillation of (p n (-) over B n ) is positive. Applying Lemma 3.5(i) 
of Leeb and Potscher (2006a) with ( n = p n -1 / 2 and the set Go equal to the set G, it remains, in light 
of Proposition 2.1, to show that there exists a p , < p < oo, such that G 00i 6»,<j, 7 (t) as a function of 7 is 
non-constant on the set {7 G M q * : \\"f\\ < p }. In view of Lemma 3.1 of Leeb and Potscher (2006a), the 
corresponding <5o can then be chosen as any positive number less than one-half of the oscillation of Goo^, CT , 7 (£) 
over this set. That such a p indeed exists follows now from Lemma A. 4 in Appendix A, where it is also 
shown that p and <5o can be chosen such that they depend only on t, Q, A, a, and c p for O < p < P. This 
completes the proof of (20) and (21). 

To prove (22) we use Corollary 3.4 in Leeb and Potscher (2006a) with the same identification of 
notation as above, with ( n = p^rT 1 / 2 , and with V = M q * (viewed as a vector space isomorphic to 
R 9 ). The asymptotic uniform cquicontinuity condition in that corollary is then satisfied in view of 
\\Pn,e,a - Pn,tf,<j\\ TV < 2$(||0 - A^ x (X' X) /2a) - 1. Given that the positivity of S* has already be es- 
tablished in the previous paragraph, applying Corollary 3.4(i) in Leeb and Potscher (2006a) then establishes 
(22). □ 

Proof of Remark 2.4: The proof is similar to the proof of (22) just given, except for using Corol- 
lary 3.4(h) and Lemma 3.5(h) in Leeb and Potscher (2006a) instead of Corollary 3.4(i) and Lemma 3.5 (i) 
from that paper. Furthermore, Lemma A. 5 in Appendix A instead of Lemma A. 4 is used. □ 

Proof of Proposition 2.5: In view of Proposition B.l(b) and the fact that <& n ,p(-) — Q n ,p,a(-) holds 
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(in case a > 0), it suffices to show that 

sup ||$„,p, ct (0-$co^(-)IItv"^ ( 47 ) 

<r* <c<cr* 

sup P„ A<T (||$„,p,*(-) - $u,pA-)\\tv > S ) ™ ( 48 ) 

cr* <<x<<T* 

hold for each 5 > 0, and for any constants cr* and a* satisfying < cr* < a* < oo. [Note that the probability 
in (48) does in fact not depend on 0.] But this has already been established in the proof of Proposition 4.3 
of Lecb and Potschcr (2005b). □ 



D Proofs for Section 3 

Proof of Theorem 3.1: After rearranging the elements of 9 (and hence the regressors) if necessary and 
then correspondingly rearranging the rows of the matrix A, we may assume without loss of generality that 

t* = (1, . . . , 1,0), and hence that i(t») = P. That is, M Zr = Mp_i and M Vfull — Mp. Furthermore, note 

(p) 

that after this arrangement Ceo 7^ 0. Let p be the model selection procedure introduced in Section 2 with 
O = P — 1, cp = c, and co = 0. Let 9 be the corresponding post-model-selection estimator and let G n ,e,a{t) 
be as defined in Section 2.1. Condition (24) now implies: For every 9 £ Mp_i which has exactly P — 1 
non-zero coordinates 

lim P nfi ,„ ({? = tf uU }k{p = P}) = lim P n .e. a ({t = u}k{p = P - 1}) = (49) 

n — »oo n — >oo 

holds for every < a < 00. Since the sequences P n $(n) a and P n ,0,<j are contiguous for 1?^™' satisfying 

remarked prior to the proof of Theorem 2.3 in Appendix C, it follows that 
condition (49) continues to hold with P n ^( n ) a replacing P n ,e.a- This implies that for every sequence of 
positive real numbers s n with s n = 0(n -1 / 2 ), for every a, < a < 00, and for every 9 £ Mp_i which has 
exactly P — 1 non-zero coordinates 

sup \\K n ^ i<T - G n ^ ta \\ TV -> (50) 
||i?-e||< Sn 

holds as n — > 00. From (50) we conclude that the limit of K n e+7 /^ a (with respect to total variation 
distance) exists and coincides with Goo,0,cr,7- Repeating the proof of Theorem 2.3 with q* = P, with 
K n ,d,<j(t) replacing G n ^^(t), and with K n {t) replacing G n (t) gives the desired result. □ 
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